Pandas Data Analysis

安装 pip

1
2
3
4
5
6

later use utf-8-sig instead of utf-8

pip install --target=d:\somewhere\other\than\the\default package_name
带空格处理
pip install --target=X:\My~Documents\py_pkg\ selenium

进行分组统计

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pandas as pd
df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})
df
data1 data2 key1 key2
0 -0.410673 0.519378 a one
1 -2.120793 0.199074 a two
2 0.642216 -0.143671 b one
3 0.975133 -0.592994 b two
4 -1.017495 -0.530459 a one

grouped = df['data1'].groupby(df['key1'])

grouped.mean()
key1
a -1.182987
b 0.808674
dtype: float64

means = df['data1'].groupby([df['key1'], df['key2']]).mean()
means.unstack()
key2 one two
key1
a -0.714084 -2.120793
b 0.642216 0.975133

选取两个dataframe中列公共的元素

有df1.ID, df2.ID 两列,我需要选取两列中都有的元素。
可以用如下方法:

1
2
3
4
5
6
7
import pandas as pd
m_df=pd.merge(df1,df2,how="inner",on="ID")
ind=m_df.ID
df1.set_index('ID',inplace=True)
df2.set_index('ID',inplace=True)
common_df1=df1.loc[ind,:]
common_df2=df2.loc[ind,:]

改变pandas 列的类型

把数值型的改变成string

1
2
3
4
5
6
7
import pandas as pd
df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
'key2':['one', 'two', 'one', 'two', 'one'],
'data1':np.random.randn(5),
'data2':np.random.randn(5)})

df["data1"]=df["data1"].astype(str)

寻找缺失值或者NaN值所带的位置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
                   a         b
2011-01-01 00:00:00 1.883381 -0.416629
2011-01-01 01:00:00 0.149948 -1.782170
2011-01-01 02:00:00 -0.407604 0.314168
2011-01-01 03:00:00 1.452354 NaN
2011-01-01 04:00:00 -1.224869 -0.947457
2011-01-01 05:00:00 0.498326 0.070416
2011-01-01 06:00:00 0.401665 NaN
2011-01-01 07:00:00 -0.019766 0.533641
2011-01-01 08:00:00 -1.101303 -1.408561
2011-01-01 09:00:00 1.671795 -0.764629


import numpy as np
index = df['b'].index[df['b'].apply(np.isnan)]

df['a'].ix[index[0]]
>>> 1.452354

df_index = df.index.values.tolist()
[df_index.index(i) for i in index]
>>> [3, 6]

iloc 与 loc 区别 at 与 iat 区别 ix

dataframe.iloc按下标选取,或者使用dataframe.loc按索引选取:

如果不是需要访问特定行列,而只是某个特殊位置的元素的话,dataframe.at和dataframe.iat是最快的方式,它们分别用于使用索引和下标进行访问:

ataframe.ix可以混合使用索引和下标进行访问,唯一需要注意的地方是行列内部需要一致,不可以同时使用索引和标签访问行或者列,不然的话,将会得到意外的结果。

pandas 处理缺失数据

原始数据的中很可能存在一些数据的缺失,就如同现在处理的这个样例数据一样,处理缺失数据有多种方式。通常使用dataframe.dropna(),dataframe.dropna()可以按行丢弃带有nan的数据;若指定how=’all’(默认是’any’),则只在整行全部是nan时丢弃数据;若指定thresh,则表示当某行数据非缺失列数超过指定数值时才保留;要指定根据某列丢弃可以通过subset完成。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
print "Data size before filtering:"
print df.shape

print "Drop all rows that have any NaN values:"
print "Data size after filtering:"
print df.dropna().shape
print df.dropna().head(10)

print "Drop only if all columns are NaN:"
print "Data size after filtering:"
print df.dropna(how='all').shape
print df.dropna(how='all').head(10)

print "Drop rows who do not have at least six values that are not NaN"
print "Data size after filtering:"
print df.dropna(thresh=6).shape
print df.dropna(thresh=6).head(10)

print "Drop only if NaN in specific column:"
print "Data size after filtering:"
print df.dropna(subset=['closePrice']).shape
print df.dropna(subset=['closePrice']).head(10)